Morphological Analysis and Diacritical Arabic Text Compression

نویسنده

  • Amjad M Daoud
چکیده

Morphological analysis of Arabic words allows decreasing the storage requirements of the Arabic dictionaries, more efficient encoding of diacritical Arabic text, faster spelling and efficient Optical character recognition. All these factors allow efficient storage and archival of multilingual digital libraries that include Arabic texts. This paper presents a lossless compression algorithm based on the affix analysis that takes advantage of the statistical studies of the diacritical Arabic morphological features. The algorithm decomposes a given Arabic word into its root and its affixes. The affixes (prefix, infix, and suffix) are the redundant elements of the word. The roots are stored in the root dictionary. Also, we maintain categorized affix dictionaries and their valid combinations to validate and generate the morphological forms during encoding and decoding using a list of patterns. Since our goal is lossless reproducible Arabic text, stemming is not an option and noise words (high frequency words) cannot be filtered out. The size of the obtained root dictionary is about 8000 three-character roots and 700 four character roots. We also code the most frequently occurring diacritical bigrams (biliterals) and trigrams (triliterals) with unused codewords in ASCII, ASMO-449, and Unicode standard codes. Using combined methods of root dictionaries and the proposed coding scheme, compression ratios of proper Arabic text compare favorably with other unigram non-diacritical methods. Keyword: Compression, Affixes, Morphological Analysis, Dictionary, Root, Diacritics, lexicon, spelling, archival, digital library.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Compression Technique for Arabic Dictionaries: The Affix Analysis

In every application that concerns the automatic processing of natural language, the problem of the dictionary size is posed. In this paper , we propose a compression dictionary al~orithm based on an affix analysis of the non diacritical Arabic. It consists in decomposing a word into its first elements taking into account the different linguistic transformations that can affect the morphologica...

متن کامل

Enhancing Retrieval Effectiveness of Diacritisized Arabic Passages Using Stemmer and Thesaurus

In this paper we discuss the enhancement of Arabic passage retrieval for both diacritisized and nondiacritisized text. Most previous work suggested that retrieval start with pre-processing the Arabic text to remove the diacritical marks (short vowels) to unify the text. In most cases, this process causes considerable ambiguity at the word level in the absence of context. However, searching for ...

متن کامل

1 Machine Generation of Arabic Diacritical Marks

The absence of the vowelization marks from the modern Arabic text represents a major obstacle in machine translation and other text understanding applications. In this paper we present a formulation of the problem of automatic generation of the Arabic diacritic marks from unvoweled text using a Hidden Markov Model (HMM) approach. The model considers the word sequence of unvoweled Arabic text as...

متن کامل

Machine Generation of Arabic

The absence of the vowelization marks from the modern Arabic text represents a major obstacle in machine translation and other text understanding applications. In this paper we present a formulation of the problem of automatic generation of the Arabic diacritical marks from unvoweled text using a Hidden Markov Model (HMM) approach. The model considers the word sequence of unvoweled Arabic text ...

متن کامل

Design of Arabic Diacritical Marks

Diacritical marks play a crucial role in meeting the criteria of usability of typographic text, such as: homogeneity, clarity and legibility. To change the diacritic of a letter in a word could completely change its semantic. The situation is very complicated with multilingual text. Indeed, the problem of design becomes more difficult by the presence of diacritics that come from various scripts...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010